Skip to content

Comments

swev-id: scikit-learn__scikit-learn-13124 StratifiedKFold shuffle fix#49

Open
casey-brooks wants to merge 2 commits intoscikit-learn__scikit-learn-13124from
noa/issue-21
Open

swev-id: scikit-learn__scikit-learn-13124 StratifiedKFold shuffle fix#49
casey-brooks wants to merge 2 commits intoscikit-learn__scikit-learn-13124from
noa/issue-21

Conversation

@casey-brooks
Copy link

@casey-brooks casey-brooks commented Dec 26, 2025

Summary

Reproduction Steps

import numpy as np
from sklearn.model_selection import StratifiedKFold

y = np.array([0]*20 + [1]*20)
X = np.zeros_like(y)

skf0 = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
skf1 = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)

idx0 = np.where(y == 0)[0]
idx1 = np.where(y == 1)[0]

perm0_c0, perm0_c1 = [], []
for _, test_idx in skf0.split(X, y):
    perm0_c0.extend([np.where(idx0 == i)[0][0] for i in test_idx if y[i] == 0])
    perm0_c1.extend([np.where(idx1 == i)[0][0] for i in test_idx if y[i] == 1])

perm1_c0, perm1_c1 = [], []
for _, test_idx in skf1.split(X, y):
    perm1_c0.extend([np.where(idx0 == i)[0][0] for i in test_idx if y[i] == 0])
    perm1_c1.extend([np.where(idx1 == i)[0][0] for i in test_idx if y[i] == 1])

print('seed=0 class0 permutation:', perm0_c0)
print('seed=0 class1 permutation:', perm0_c1)
print('seed=1 class0 permutation:', perm1_c0)
print('seed=1 class1 permutation:', perm1_c1)
print('perm0_c0 == perm0_c1?', perm0_c0 == perm0_c1)
print('perm1_c0 == perm1_c1?', perm1_c0 == perm1_c1)

Observed Failure (pre-fix)

  • Prior to this change, on the target branch, the above script printed True for both equality checks, indicating coordinated permutations across classes (identical within-stratum shuffles) even with different seeds. There is no exception; the failure is behavioral and reproducibility-related.
  • Example (abridged) output before fix:
seed=0 class0 permutation: [7, 18, 5, 10, 0, ...]
seed=0 class1 permutation: [7, 18, 5, 10, 0, ...]
seed=1 class0 permutation: [13, 2, 16, 6, 9, ...]
seed=1 class1 permutation: [13, 2, 16, 6, 9, ...]
perm0_c0 == perm0_c1? True
perm1_c0 == perm1_c1? True

Fix and Post-Fix Behavior

  • After this fix, per-class permutations are independent:
perm0_c0 == perm0_c1? False
perm1_c0 == perm1_c1? False
  • New tests added:
    • test_stratifiedkfold_shuffle_independent_per_class_permutation
    • test_stratifiedkfold_shuffle_different_seeds_change_permutations
    • test_stratifiedkfold_no_shuffle_unchanged

Local Test Logs

  • Environment: conda-forge micromamba (AArch64), Python 3.8, NumPy 1.21.6, SciPy 1.7.3, Cython 0.29.36.
  • Commands executed (no CI):
python setup.py build_ext -i
pip install -e . --no-build-isolation
pytest -q sklearn/model_selection/tests/test_split.py::test_shuffle_stratifiedkfold_reproducibility
pytest -q sklearn/model_selection/tests/test_split.py::test_stratifiedkfold_shuffle_independent_per_class_permutation
pytest -q sklearn/model_selection/tests/test_split.py::test_stratifiedkfold_shuffle_different_seeds_change_permutations
pytest -q sklearn/model_selection/tests/test_split.py::test_stratifiedkfold_no_shuffle_unchanged
  • Results: all targeted tests passed locally. No CI run was performed.

Notes

  • This change intentionally alters split assignments for equal-sized classes when shuffle=True and random_state is an integer, to align with documentation and expected semantics (shuffle within each stratification independently).

@casey-brooks
Copy link
Author

Test & Lint Summary

  • SKLEARN_SITE_JOBLIB=1 /workspace/bin/micromamba run -n skl021 pytest sklearn/model_selection/tests/test_split.py::test_shuffle_stratifiedkfold
  • /workspace/bin/micromamba run -n skl021 flake8 --extend-ignore=E741 sklearn/model_selection/_split.py sklearn/model_selection/tests/test_split.py

@rowan-stein rowan-stein changed the base branch from main to scikit-learn__scikit-learn-13124 December 26, 2025 04:28
@rowan-stein rowan-stein requested a review from a team December 26, 2025 04:28
Copy link

@noa-lucent noa-lucent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix looks good, but the regression coverage does not hit the original failure. Please address the feedback.

kf2 = StratifiedKFold(5, shuffle=True, random_state=1)
test_set1 = sorted([tuple(s[1]) for s in kf1.split(X, y)])
test_set2 = sorted([tuple(s[1]) for s in kf2.split(X, y)])
assert test_set1 != test_set2

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[major] The new regression check never exercises the original failure. On the buggy branch, and already yield different split assignments, so this assertion passes even without the fix. The behavioral bug was that within a single seed, the per-class permutations were coordinated (equal-sized strata received identical permutations). We need a test that fails prior to the fix by verifying, for example, that for a balanced label vector, the sequence of class-0 permutations differs from the sequence of class-1 permutations when and is an int. Could you rewrite this regression test to capture that scenario (and keep the seed-vs-seed check if you still find it valuable)?}

Copy link

@noa-lucent noa-lucent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update—the per-class regression test now catches the original failure, and the rest of the changes look solid. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants